Alexander Feldman V.1.0
This project is carried out with the aim of opening a new robot-run cafe in Los Angeles.
Research objectives:
#!pip install plotly --upgrade
#!pip install pandas_profiling
#!pip install usaddress
import pandas as pd
import numpy as np
import scipy.stats as st
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly import graph_objects as go
import plotly.figure_factory as ff
import pandas_profiling as pp
import usaddress
path_local_work = 'datasets/rest_data_us.csv'
path_platform_work = '/datasets/rest_data_us.csv'
try:
rests = pd.read_csv(path_platform_work)
except:
rests = pd.read_csv(path_local_work)
#pp.ProfileReport(rests)
The quality of data is fine. There are three missing values in chain column
# Rename columns
rests.columns=['id', 'name', 'address', 'chain', 'type', 'seats']
rests.head()
# Fill missing values as False (not chain), because studying the name of objects say us that it rather
# not chain establishments
rests['chain'] = rests['chain'].fillna(False)
rests['address'] = rests['address'].str.upper()
Investigate the proportions of the various types of establishments. Plot a graph.
group_types = rests.groupby('type', as_index=False)['id'].count()
fig = px.pie(group_types, values='id', names='type', title='Proportions of the various types of establishments',
color_discrete_sequence=px.colors.qualitative.Set1)
fig.show()
As we can see from the graph more than 75% of establishments are restaurants. In the second place is fast food - 11%. The remains have shares of less than 5%.
Investigate the proportions of chain and nonchain establishments. Plot a graph.
group_chain = rests.groupby('chain', as_index=False)['id'].count()
group_chain['chain'] = group_chain['chain'].replace({False:'Nonchain', True:'Chain'})
fig = go.Figure(data=[go.Pie(labels=group_chain['chain'],
values=group_chain['id'],
textinfo='label+percent',
insidetextorientation='radial',
hole=.3
)])
fig.update_traces(marker=dict(colors=['rgb(255,127,0)','rgb(166,86,40)']))
fig.update_layout(title='Proportions of chain and nonchain establishments', showlegend=False)
fig.show()
Almost 40% of places are chain. Globalization is coming.
Which type of establishment is typically a chain?
chain_rests = rests[rests['chain'] == True]
chain_rests_types = chain_rests.groupby('type', as_index=False)['id'].count().sort_values('id', ascending=False)
fig = px.bar(chain_rests_types, x='type', y='id', title='Chain establishment types', color='type',
labels={'type':'Type of establishment', 'id':'Number of establishments'}, text='id',
color_discrete_sequence=px.colors.qualitative.Set1)
fig.update_traces(texttemplate='%{text}', textposition='outside', showlegend=False)
fig.show()
In general for chains also save the trend for proportions of various types of establishments. However, the share of bakeries is higher for chains and the share of bars is lower.
What characterizes chains: many establishments with a small number of seats or a few establishments with a lot of seats?
# Build the graph
x_values = pd.Series(range(0, len(chain_rests)))
fig = px.scatter(chain_rests, x=x_values, y='seats', color='type', title='Number of seats in chain establishments',
labels={'x':'Count of establishments', 'seats':'Number of seats'},
color_discrete_sequence=px.colors.qualitative.Set1)
fig.show()
# plot the histogram
hist_data = [chain_rests['seats']]
group_labels = ['Distribution of seats'] # name of the dataset
fig = ff.create_distplot(hist_data, group_labels)
fig.update_layout(title='Distribution of seats in chain establishments')
fig.show()
print('The average of number of seats for chains establishments is {:.1f}. The median value is {:.1f}'.format(
chain_rests['seats'].mean(), chain_rests['seats'].median()))
As we can see on the scatter plot the range of the number of seats from 0 to 250. There is a clear division of density for several areas. The first area is less than 49 seats, the second one is from 50 to 150 seats and the third is more than 150 seats. The histogram also lets to notice the additional area is less than 29 seats.
Perhaps there are some features of the regulation of establishments with a different number of seats.
The average number of seats for chains establishments is 39.7. Since the distribution is not normal the median value has a significant difference.
The median value is 25.0.
chain_rests['group_seats'] = np.where(chain_rests['seats']<=29,
'<29',
np.where((chain_rests['seats']>29) & (chain_rests['seats']<49),
'30-49',
np.where((chain_rests['seats']>49) & (chain_rests['seats']<149),
'50-149', '>150')
)
)
group_seats = chain_rests.groupby('group_seats', as_index=False)['id'].count()
fig = px.pie(group_seats, values='id', names='group_seats',
title='Proportions of the various number of seats of chain establishments',
color_discrete_sequence=px.colors.qualitative.Set1)
fig.show()
The largest part of establishments has less than 29 seats (58,8%). On the 2d place, the area from 30 to 49 seats (22,3%). On the 3d and 4th, areas up to 149 seats and from 150 seats respectively (13,8% and 5,06%).
Determine the average number of seats for each type of restaurant. On average, which type of restaurant has the greatest number of seats? Plot graphs.
# Plot the graph
plt.figure(figsize=(10,6))
ax = sns.barplot(x="type", y="seats", data=rests)
plt.title('Average number of seats by type of establishments', fontdict={'size':15})
plt.xlabel('Type of establishment')
plt.ylabel('Number of seats')
plt.gca().spines["top"].set_alpha(0.0)
plt.gca().spines["bottom"].set_alpha(0.5)
plt.gca().spines["right"].set_alpha(0.0)
plt.gca().spines["left"].set_alpha(0.5)
plt.show()
As we can see from the bar chart, the average values for different types of places (both chain and conventional) are from 20 to 50 seats. Moreover, restaurants, bars and fast foods have an average value of 30 to 49 seats, and bakery and pizza cafes up to 29 seats.
Put the data on street names from the address column in a separate column.
# make a function for separating streets from adress
def cleaning_street(raw):
# treat exceptions
if raw.startswith('OLVERA'):
clean_street = 'OLVERA ST'
elif raw.startswith('1033 1/2 LOS ANGELES'):
clean_street = 'LOS ANGELES ST'
elif raw.startswith('308 WESTWOOD PLZ'):
clean_street = 'WESTWOOD PLZ'
elif raw.startswith('5333 Z00'):
clean_street = 'ZOO DR'
elif raw.startswith('5333 ZOO'):
clean_street = 'ZOO DR'
elif raw.startswith('724 VINE'):
clean_street = 'VINE ST'
# main process
else:
raw_address=usaddress.parse(raw)
clean_street = ''
for i in raw_address:
if i[1] == 'StreetNamePreDirectional' or i[1] == 'StreetName' or i[1] == 'StreetNamePostType':
clean_street = clean_street +' '+ str(i[0])
return clean_street
rests['street'] = rests.address.apply(cleaning_street)
rests.sample(10)
Plot a graph of the top ten streets by number of restaurants.
streets_rest = rests.groupby('street', as_index=False)['id'].count()
streets_rest = streets_rest.rename(columns={'id':'count'})
top_ten_streets = streets_rest.sort_values('count', ascending=False).head(10)
fig = px.bar(top_ten_streets, x='count', y='street', title='Top ten streets by number of establishments',
color='street', text='count', orientation='h',
color_discrete_sequence=px.colors.qualitative.Set1)
fig.update_traces(texttemplate='%{text}', textposition='auto')
fig.show()
print('The share of the top ten streets restaurants is {:.0%} from total number of restaurants of LA.'.format(
top_ten_streets['count'].sum() / len(rests)))
The top 10 streets have a huge number of restaurants. There are almost 30% of the city's establishments.
Find the number of streets that only have one restaurant.
streets_rest['one_rest'] = np.where(streets_rest['count']==1, 'Streets with only one place', 'Streets with many places')
print('We have {} streets with only one place from {} total number of streets'.format(
streets_rest[streets_rest['count']==1]['street'].count(), len(streets_rest)))
group_one = streets_rest.groupby('one_rest', as_index=False)['street'].count()
fig = px.pie(group_one, values = 'street', names='one_rest', color='one_rest',
title='Proportions of the streets with only one place and many places',
color_discrete_sequence=px.colors.qualitative.Set1)
fig.update_traces(textinfo='value + percent')
fig.show()
As we can see from the pie chart, 43,5% of streets of LA have only one establishment.
For streets with a lot of restaurants, look at the distribution of the number of seats. What trends can you see?
rests_top_streets = rests[rests['street'].isin(top_ten_streets['street'])]
# Plot the histogram of distribution of number seats for the top ten streets.
plt.figure(figsize=(18,6))
sns.kdeplot(data=rests_top_streets, x='seats', hue='street', fill=True)
plt.title('The distribution of the number of seats for top 10 streets', fontdict={'size':15})
plt.xlabel('Number of seats')
plt.axvline(x=28, color='red', linestyle='--', alpha=.5)
plt.gca().spines["top"].set_alpha(0.0)
plt.gca().spines["bottom"].set_alpha(0.3)
plt.gca().spines["right"].set_alpha(0.0)
plt.gca().spines["left"].set_alpha(0.3)
plt.show()
print('Tha average number of seats for the top ten streets establishments is {:.2f}. The median value is {:.2f}'.format(
rests_top_streets['seats'].mean(), rests_top_streets['seats'].median()))
To compare these results with other streets, let's divide the street into three groups:
rests['group'] = np.where(rests['street'].isin(streets_rest[streets_rest['count']>=213]['street']),
'A',
np.where(rests['street'].isin(streets_rest[streets_rest['count']<=20]['street']),
'C', 'B')
)
# make violin plot for three groups:
fig = px.violin(rests, y="seats", x="group", box=True, points="all", color='group',
title='The distribution of the number of seats by streets group')
fig.show()
At first glance, the distributions are very similar to each other.
Make the test for calculating the statistical significance of average number of seats of establishments for difference groups of streets.
How we can see on the prievues graph, groups have a not normal distibutions. We should use Mann-Whitney U-test for testing.
Null hypothesis: The average number of seats between the groups are equal.
Alternative hypothesis: The average number of seats between the groups have a significant difference.
Since we test 3 hypotheses (A/B, A/C and B/C) apply the Bonferroni correction for alpha for multiple comparisons.
Alpha level = 0.05 / 3 = 0.01666
group_A = rests[rests['group']=='A']['seats']
group_B = rests[rests['group']=='B']['seats']
group_C = rests[rests['group']=='C']['seats']
# Make the Mann-Whitney A/B U-test
alpha = 0.01666
results = st.mannwhitneyu(group_A, group_B)
print('p-value: {0:.5f}'.format(results.pvalue))
if (results.pvalue < alpha):
print('We reject the null hypothesis. The average number of seats of establishments of the groups are not equal')
else:
print("We can't reject the null hypothesis that the average number of seats of establishments of the groups are equal")
print('The difference is {0:.3%}'.format(group_B.mean()/group_A.mean()-1))
# Make the Mann-Whitney A/C U-test
alpha = 0.01666
results = st.mannwhitneyu(group_A, group_C)
print('p-value: {0:.3f}'.format(results.pvalue))
if (results.pvalue < alpha):
print('We reject the null hypothesis. The average number of seats of establishments of the groups are not equal')
else:
print("We can't reject the null hypothesis that the average number of seats of establishments of the groups are equal")
print('The difference is {0:.3%}'.format(group_C.mean()/group_A.mean()-1))
# Make the Mann-Whitney B/C U-test
alpha = 0.01666
results = st.mannwhitneyu(group_B, group_C)
print('p-value: {0:.3f}'.format(results.pvalue))
if (results.pvalue < alpha):
print('We reject the null hypothesis. The average number of seats of establishments of the groups are not equal')
else:
print("We can't reject the null hypothesis that the average number of seats of establishments of the groups are equal")
print('The difference is {0:.3%}'.format(group_C.mean()/group_B.mean()-1))
print('The mean value of A group = {:.2f}. The mean value of B group = {:.2f}. The mean value of C group = {:.2f}'.format(
group_A.mean(), group_B.mean(), group_C.mean()))
As we can see from the test results, despite the close the calculated mean values, a statistically significant difference between the mean values is present between groups A and B and also between A and C. The mean values for Group B and C can be considered equal.
As a result of our research, we found out the following:
It is worth starting a project by opening one RUN-ROBOT CAFE, but then developing a chain of establishments. Since we want to open an unusual place, we are more interested in the tourist areas and the city center. We need to strive to make our restaurants a tourist attraction.
For the first establishment recommend the following characteristics:
But for the second restaurant (after successfully working out the business model in the first restaurant), it makes sense to choose a location on the top 10 streets with up to 49 seats or up to 149 seats
Recommend that the results of this stage of the research, as well as a description of the next stage, be included in the offer for investors.